Statistical models for improving the performance of database operations

ABSTRACT

A method for performing an automatic software-driven statistical evaluation of a large amount of data to be assigned to statistical variables in a database contained in at least one cluster. The method is characterized by using a statistical model to model an approximate description of a relative frequency of the state or states of the statistical variables and a statistical dependencies between the state or states, and then determining the approximate relative frequency of the state or states of the statistical variables and the approximate relative frequency belonging to a predetermined relative frequency of the state or states of the statistical variables and an expected value of the state or states of the statistical variables dependent thereon.

This invention relates to a method for the automatic, software-drivenstatistical evaluation of large amounts of data that is to be assignedto statistical variables in a database. The data to be evaluated can, inparticular, be contained in one or several clusters.

Nowadays databases are in the position to store immense amounts of data.In order to evaluate the stored data and to be able to extractprofitable information, efficient i.e. quick and specific databaseaccesses are required because of the data occupancy.

In general, for an evaluation all the data must be found that conformsto a pre-determinable condition. Often it is not the case that thelocated data itself must be known, but often only information about thestatistics based on the data is required.

If, for example, in a customer relationship management (CRM) system inwhich customer data is stored, it be determined what proportion ofcustomers with specific features bought a certain product, a simpleprocedure could be to access all the customer entries in the database,request all the features of the customers and under these to find outand count those entries which “match” the desired features for which thecustomers bought the specific product. For example, such a request tothe database could be as follows: how often were specific mobiletelephones purchased by male customers who are at least 30 years old?Therefore, all the customer entries that conform to the requirements“male” and “at least 30 years old” must be found in which case a testmust be performed for the matching entries found to determine whichmobile telephone was purchased the most.

However, a disadvantage of this procedure is the fact that the entiredatabase has to be read to find the matching entries. This canoccasionally take a very long time in the case of very large databases.

The database can be searched more skillfully and more efficiently if allthe variables are provided with selective indexes that can be requested.In this case it is a rule that the more exact and sophisticated theapplicable index technique of a database is, the quicker the databasecan be accessed. More efficient statistical information about thedatabase entries can also be provided accordingly. This in particularapplies if the database is specifically prepared by a special indextechnique for the requests to be expected.

Alternatively or in combination with index techniques, the results ofall the statistical requests to be expected can be pre-calculated whichhas the disadvantage of considerable effort required for thecalculations and storage of results.

The term “online analytical processing” (OLAP) characterizes a class ofmethods for extracting statistical information from the data of adatabase. In general, such methods can be subdivided into “relationalonline analytical processing” (ROLAP) and “multidimensional onlineanalytical processing” (MOLAP).

The ROLAP method only makes slight pre-calculations. When requesting thestatistics, the data about the index techniques required for a responseto the request is accessed and the statistics are then calculated fromthe data. The emphasis of ROLAP is then on a suitable organization andindexing of the data to find and load the required data as quickly aspossible. Nevertheless, the effort for large amounts of data can stillbe very great and in addition the selected indexing is sometimes notoptimum for all the requests.

In the MOLAP method the focus is on pre-calculating the results for manypossible requests. As a result, the response time for a pre-calculatedrequest remains very short. For requests that have not beenpre-calculated, the pre-calculated values can sometimes also lead to anacceleration if the desired sizes can be calculated from thepre-calculated results, and this means that it is more cost-effectivethan directly accessing the data. The number of all possible requestsincreases rapidly with the increasing number of states of thesevariables so that the pre-calculation hits against the limits of thepresent possibilities with regard to memory location and turnaroundtime. Restrictions with regard to the variables considered, thedifferent states of these variables or the permissible requests mustthen be taken into consideration.

Even though the OLAP method guarantees an increase in the efficiencycompared to merely accessing each database entry it is disadvantageousthat a great amount of redundant information has to be generated here.Therefore, statistics must be pre-calculated and extensive index listscreated. In general, an efficient application of an OLAP method alsorequires that this method is optimized to specific requests in whichcase the OLAP method is then also subject to these selectedrestrictions, i.e. no random requests can be made to the database.

In addition, it is also true for the OLAP method that, the more quicklythe information is to be provided and the more this information varies,the more structures must be pre-calculated and stored. Therefore, OLAPsystems can become very large and are by far less efficient than wouldbe desired, response times of less than one second can in practice notbe implemented for any statistical requests to a large database. Oftenthe response times are considerably more than one second.

Therefore, there is a need for more efficient methods for thestatistical evaluation of data entries. In such cases the requestsshould not be subject to any restrictions if possible.

The object of this invention is to overcome the disadvantages of themethods known in the prior art, particularly, the OLAP method for thestatistical evaluation of database entries.

The methods according to the features of the contingent claims achievethis object according to the invention. Advantageous developments of theinvention are specified in the subclaims.

According to the invention, a method is shown for the automatic,software-driven statistical evaluation of large amounts of data that isto be assigned to statistical variables in a database, in particular,contained in one or more clusters which is characterized in that astatistical model for the approximate description of the relativefrequencies of the states of the variables and the statisticaldependencies between said states, is learnt by means of the data storedin the database and is used to determine, on the basis of thestatistical model, the approximate relative frequencies of states of thevariables, in addition to the approximate relative frequencies belongingto the pre-determinable relative frequencies of states of the variablesand expected values of the states of variables dependent thereon.

Unlike the conventional method for statistical evaluation of the datafrom databases, the model is not an exact image of the statistics of thedata. In general, this procedure obtains no exact, but only approximatestatistical statements. However, the statistical models are subject tofewer restrictions than, for example, the conventional OLAP methods.

In order to make approximate, statistical statements, the entries arethen “condensed” in a database to a statistical model in which case thestatistical model virtually represents an approximation of the “commonprobabilistic distribution” of the database entries. In practice, thistakes place by learning the statistical model on the basis of databaseentries, in which case the relative frequencies of the states of thevariables of the database entries can approximately be described in thissequence. Therefore, the variables can capture many states withdifferent, relative frequencies. As soon as such a statistical model isavailable, this can be used to study the relative dependencies betweenthe states of the variables. According to a pre-determinable condition,relative frequencies of the states of the variables can be specified inthis way and are used to determine the relative frequencies of states ofthe variables belonging to predetermined relative frequencies of statesof the variables dependent thereon.

A statistical request to the database can in this way be made as acondition for the relative frequencies of specific states of thevariables in which case a response to the statistical request is used todetermine the relative frequencies of states of the variables belongingto predetermined relative frequencies of states of the variablesdependent thereon.

As the statistical model, a graphical probabilistic model is preferablyused (see e.g.: Castillo, Jose Manuel Gutierrez, Ali S. Hadi, ExpertSystems and Probabilistic Network Models, Springer, N.Y.). The graphicalprobabilistic models particularly include the Bayesian networks orBelief networks and Markov networks.

A statistical model can for example be generated by structured theoriesin Bayesian networks (see e.g.: Reimar Hofmann, Lernen der Strukturnichtlinearer Abhangigkeiten mit graphischen Modellen—learning thestructure of non-linear dependencies with graphical models—,Dissertation, Berlin, or David Heckermann, a tutorial on learningBayesian networks, Technical Report MSR-TR-95-06, Microsoft Research).

A further possibility is to learn the parameters for a fixed structure(see e.g.: Martin A. Tanner: Tools for Statistical Inference, SpringerN.Y. 1996).

Many learning methods use the likelihood function as an optimizationcriterion for the parameters of the model. A particular embodiment hereis the expectation maximation (EM) learning method that is explainedbelow in detail on the basis of a special model. In principle, it mainlydoes not concern a generalization ability of the models, but it is onlynecessary to obtain a good adaptation of the models to the data.

As the statistical model, a statistical clustering model preferably aBayesian clustering model is used by means of which the data issubdivided into many clusters.

Similarly, a clustering model based on a distance measurement can beused together with a statistical model by means of which the data islikewise subdivided into many clusters.

By using clustering models, a very large database breaks down intosmaller clusters that, for their part, can be interpreted as separatedatabases and can be handled more efficiently based on the comparablysmaller size. Here the statistical evaluation of the database testswhether or not a predetermined condition can be mapped via thestatistical model to one or more clusters. Should this be applicable,the evaluated data will be restricted to one cluster or a number ofclusters. Similarly, it is possible that such clusters are restricted tothose in which the data conforming to the predetermined conditioncontains at least one specific relative frequency. The remainingclusters in which only a smaller amount of data is contained accordingto the predetermined condition can be ignored because the consideredprocedure only aims at approximate statements.

For example, a Bayesian clustering model (a model with a discrete latentvariable) is used as a statistical clustering model.

This is described in further detail below:

Given a set of statistical variables {A, B, C, D, . . . }, or in otherwords, a set of fields in a database table. The relevant lower caseletters describe the states of the variables. Therefore, variable A canalso accept the states {a₁ a₂, . . . }. The states are assumed to bediscrete; but in general continuous (real-value) variables are alsopermitted.

An entry in the database table consists of values for all the variablesin which case the values belonging to an entry are combined into onedata record D for all the variables. For example, x^(Π)=(a^(Π), b^(Π),c^(Π), d^(Π), . . .) describes the Πth data record. The table has Mentries, i.e. D={x^(Π), Π=1, . . . ,M}.

In addition, there is a hidden variable (cluster variable) that isdesignated with Ω. The cluster variable can accept the values {ω_(i),i=1, . . . ,N}; i.e. there are N clusters.

Here, P(Ω|θ) describes a priori distribution of the cluster in whichcase the a priori weight of the ith cluster is given via P(ω_(i)|θ) andθ represents the parameters of the model. The a priori distributiondescribes which cluster of the data is assigned to the relevantclusters.

The expression P (A, B, C, D, . . . |ω_(i), |θ) describes the structureof the ith cluster or the conditional distribution of the variables ofthe variable set {A, B, C, D, . . . } within the ith cluster.

The a priori distribution and the distributions of the conditionalprobabilities of each cluster thus together parameterize one commonprobabilistic model on {A, B, C, D, . . . } U Ω or on {A, B, C, D, . . .}. The probabilistic model is given by the product from the a prioridistribution and the conditional distributionP(A,B,C, . . . ,Ω|Θ)=P(Ω|Θ)P(A,B,C, . . . |Ω,Θ),or byP(A,B,C, . . . |Θ)=Σ_(i) P(ω_(i)|Θ)P(A,B,C, . . . |ω _(i),Θ).

The logarithmic likelihood function L of parameter θ of the data recordD is now given byL(Θ)=log P(D|Θ)=Σ_(Π) log P(x ^(Π|)Θ).

Within the context of the expectation maximation (EM) theory, a sequenceof parameters θ^((t)) is now constructed according to the followinggeneral specification:Θ^((t+1))=arg max_(Θ)Σ_(Π)Σ_(i) P(ω_(i) |x ^(Π),Θ^((t)))log P(x^(Π),ω_(i)|Θ)

This iteration specification maximizes the likelihood function step bystep.

For the conditional distributions P(A, B, C, D, . . . ω_(i), θ), 0restrictive assumptions can (and must, in general) be made. An exampleof such a restrictive assumption is the following factorizationassumption:

If for example for the distribution of the conditional probabilitiesP(A, B, C, D, . . . ω_(i), θ) of the variables of the variable set {A,B, C, D, . . . }, the factorization P(A, B, C, D, . . . ω_(i),θ)=P(Aω_(i)θ)P (Bω_(i)θ)P(Cω_(i)θ)P(Dω_(i)θ) . . . is accepted, theprobabilistic model conforms to a naive Bayesian network. Instead of alargely dimensional table one is now only confronted with many onedimensional tables (tables for one variable in each case).

The parameters of the distribution can, as shown above, be learnt fromthe data with an expectation maximation (EM) learning method. A clustercan be assigned to each data record x^(Π)=(a^(Π), b^(Π), c^(Π), d^(Π), .. . ) after the learning process. The assignment is then takes place viathe a posteriori distribution P(Ωa^(Π), b^(Π), c^(Π), d^(Π), . . . , θ)in which case the cluster ω_(i) with the highest weight P(ω_(i) a^(Π),b^(Π), c^(Π), d^(Π), . . . , θ) is assigned to the data record x^(Π).

The cluster affiliation of each entry in the database can be stored asan additional field in the database and corresponding indexes can beprepared to quickly access the data that belongs to a specific cluster.

If, for example, a statistical request of the type “give all the datarecords with A=a₁ and B=b₃ as well as the relevant distribution via Cand D (i.e. P(C|a₁, b₃) and P(D|a₁, b₃))” is made to the database,proceed as follows:

First of all, the a posteriori distribution P(Ωa₁, b₃) is determined.From this distribution (approximate) it is clear what proportion of thedata must be found in which clusters of the database according to theset condition. In this way, it is possible to restrict oneself in thecase of all further processes, depending on the desired accuracy, to theclusters of the database that have a high a posteriori weight accordingto P(Ωa₁, b₃).

The ideal case is when P(Ωa₁, b₃)=1 applies to an i and accordinglyP(Ωa₁, b₃)=0 for all j≠i, i.e. all the data corresponding with the setcondition lies in one cluster. In such a case, it is possible torestrict oneself to the ith cluster without losing accuracy in furtherevaluation.

In order to obtain (approximate) distributions for C and D, it ispossible to either carry on using the model, i.e. approximatelydetermine the desired distributions P(C|a₁, b₃) and P(D|a₁, b₃) based onthe parameters of the model:P(C|a ₁ , b ₃)≅Σ_(i) P(C|ω ₁ , a ₁ , b ₃, Θ)P(ω _(i) |a ₁ , b ₃, Θ).

However, alternatively the model can also only be used to determine theclusters that are relevant for the current request.

After restricting the request to these clusters, more exact methods canbe used within the clusters. E.g. the statistics within the clusters canbe counted exactly (with the help of an additional index referring tothe cluster affiliation or based on the conventional database reportingmethod or the OLAP method) or further statistical models adapted to theclusters can be used. A tight interlocking with OLAP is particularlyadvantageous because the so-called “sparsity” of the data is utilized inlarge dimensions by statistical clustering models and the OLAP methodsare only used effectively within the smaller dimensional clusters.

The trade-off for speed and accuracy when evaluating results from theamount of data excluded from the evaluation: the more clusters excludedfrom the evaluation, the quicker, but also more inexactly, the responseto a statistical request will be. The user himself can determine thetrade-off between accuracy and speed. In addition, more exact automaticmethods can be initiated if an insufficient accuracy from evaluating themodel seems to be apparent.

In general, clusters that are below a specific minimum weight areexcluded from the evaluation. Exact results can be obtained by excludingonly such clusters from the evaluation that have an a posteriori weightof zero. Here, an exact “indexing” of the clusters can be reached as aresult of an exact indexing of the database in which case the evaluationis accelerated in many cases. However, in general as many clusters aspossible are used for the evaluation.

Overtraining a clustering model is of no importance, because on thecontrary the aim is to produce the most exact reproduction possible ofhistorical data and not a prognosis for the future. In the same way,intensely overtrained clustering models tend to supply a the mostunambiguous possible assignment of requests to clusters, which meansthat in further operations it is possible to limit the request to smallclusters of the database very quickly.

In an advantageous way, the data belonging to a cluster is stored on adata carrier in a way appropriate to the cluster affiliation. Forexample, the data belonging to one cluster can be stored on a section ofthe hard disk so that the data in a block belonging together can be readmore quickly.

As has already been shown according to the method of the invention,conventional methods for the statistical evaluation of the data fromdatabases can also be used in a supplementary way if approximatestatements are deemed to be insufficient. In particular, conventionaldatabase reporting or OLAP methods are used to determine the relativefrequencies of the states of the variables.

A supplementary application of conventional database techniques can forexample be initiated automatically if a definable test variable acceptsor exceeds a predetermined value.

According to the invention, a method is shown for the automatic,software-driven statistical evaluation of large amounts of data that isto be assigned to statistical variables in a database, in particular,contained in one or several clusters which is characterized in that thedata is subdivided into many clusters by a clustering model based ondistance measurement and, if required, the considered data is restrictedto the data contained in one cluster or several clusters and in whichcase the database reporting methods or the OLAF methods are used todetermine the relative frequencies and expected values of the states ofvariables.

The methods shown in the invention can subdivide the data of thedatabase into clusters as well as, if required, result in a restrictionto one cluster or several clusters. If the methods according to theinvention are used for data that is already contained in one cluster orseveral clusters, the clusters are in this way subdivided intosubclusters. If restriction is to be to one or more subclusters, themethods according to the invention for the data contained therein can beused, in which case, if required, more exactly adapted statisticalmodels can be used. In general, this procedure can be repeated as oftenas desired, i.e. the clusters can be subdivided into subclusters or thesubclusters into sub-subclusters as often as desired, etc. and, ifrequired, there can be a restriction to the data contained therein ineach case and the methods according to the invention used (adapted moreexactly) for the data contained in the considered clusters.

An embodiment of the invention in the Web reporting/Web mining area isdescribed below in which case reference is made to the accompanyingdrawings.

FIG. 1 Shows different monitor windows in which variables for describingthe visitors to a Web site are displayed.

FIG. 2 Shows different monitor windows of the variables of FIG. 1 inwhich case the behavior of visitors of a specific referrer isinvestigated.

FIG. 3 Shows different monitor windows of the variables of FIG. 1 inwhich case the behavior of visitors that call up the homepage first,then read the news and subsequently again call up the homepage isinvestigated.

In general, in the Web reporting/Web mining area large amounts of datahas to be evaluated. Should a user visit a Web site, each action of theuser is usually recorded in the Web log file. This is verydata-intensive because such Web log files can increase very rapidly tosizes in the region of several gigabytes.

In order to prepare the evaluation of the Web log files, “sessions” orvisits by visitors were extracted, i.e. all the successive entries (pageretrievals or clicks) belonging to a visitor are summarized.

Each session by a visitor was characterized by a set of differentvariables, namely particularly “start time”, “session duration”, “numberof requests”, “referrer”, “1st visited category”, “2nd visitedcategory”, “3rd visited category” and “4th visited category”.

In addition, further variables (not shown in the figures) were specifiedsuch as “does the visitor accept cookies”, “number of sessions that thevisitor had already had up to the current session”, “number of pagesretrieved in the last session”, “interval in time to the last session”,“on which page did the last session end”, “time of the first session bythe visitor” and “weekday”.

Altogether, each session was characterized in this way on the basis of18 different variables.

In order to determine the relative frequencies of the states of thevariables, a naive Bayesian clustering model, as described above, wasused.

Therefore, the specified variables were integrated in the statisticalmodel. The statistical model was trained below by the data contained inthe Web log files to find good parameters for the model. The desiredrelative frequencies can then be read from the model.

The result of determining the relative frequencies of the states of thevariables is displayed in FIG. 1. FIG. 1 shows different monitor windowsin which the variables “start time”, “session duration”, “number ofrequests”, “referrer”, “1st visited category”, “2nd visited category”,“3rd visited category” and “4th visited category” to describe thevisitors to a Web site are shown.

From FIG. 1 it must particularly be identified that

-   -   approximately 55% of the visitors visit the Web site during the        afternoon or evening,    -   approximately 47% of the visitors only remain less than 1 minute        on the Web site,    -   approximately 34% of the visitors only start one request,    -   approximately 56% of the visitors do not have a referrer,    -   approximately 45% of the visitors start on the homepage, and    -   approximately 57% of the visitors only visit 1 category,        approximately 74% of the visitors only 2 categories and        approximately 85% of the visitors only 3 categories.

After the statistical model based on an EM learning method was trained,the dependencies between the variables could also be studied.

As can be seen in FIG. 2, the behavior of for example those visitorsthat came from a specific referrer (referred to as Endemann below) wasinvestigated. For this, the corresponding entry in the variable“referrer” was set at 100%. By using the statistical model, it could bedetermined within fractions of a second that particularly approximately99% of these visitors first visit the homepage and subsequently in thepredominant majority (approximately 96%) again immediately leave the Website.

FIG. 3 displays a complicated request to the database. FIG. 3 showsdifferent monitor windows of the variables to be considered in whichcase the behavior of the visitors that call up the homepage first, thenread the news and subsequently again call up the homepage isinvestigated. Here the corresponding entries in the variables “1stvisited category”, “2nd visited category” and “3rd visited category”were set at 100%.

Again, it could particularly be determined by means of the statisticalmodel within fractions of a second that these visitors thenpredominantly either again read the news (approximately 37%) or left theWeb site (approximately 36%). It can also be seen in FIG. 3 thatapproximately 89% of these visitors have no referrer.

In a corresponding way, a response could be given to an amplitude offurther requests to the database within a short period, i.e. in general,within less than 1 second. For example, it could be tested which sectionof the visitors that come from a specific referrer makes more than threeside requests, how these people are distributed over the time of day andwhich one of these visitors is a returning visitor. It could also betested how the visitor traffic of those visitors starting with thehomepage is distributed, i.e. which section of the visitors continues orsubsequently aborts the session in which way.

Such an amplitude of requests with many different variables in the caseof the data that simultaneously has the same size can only be handledmore efficiently with the method according to the invention compared tothe conventional database techniques, particularly the OLAP methods.Similarly, conventional OLAP methods can also be used in addition tothis, if exact statements are to supplement the approximate statementsgained by the statistical model. However, considerably longer responsetimes must then be taken into consideration.

To summarize, it can be established that this invention as opposed tothe conventional database techniques, particularly the databasereporting and OLAP methods, can answer statistical requests made toextensive databases more or less by using statistical models in a moreefficient way. This does not exclude that conventional techniques forevaluating databases can be used in a corresponding way to have exactstatements, if required. By using a clustering model by means of whichthe database can be broken up into smaller clusters, it is possible torestrict oneself very quickly for requests made to the relevant clustersof a database (approximately or exactly). If clusters of the databasewere restricted, a recent statistical evaluation of these clusters ofthe database can be carried out according to the invention in the courseof which, if required, a renewed restriction of the subclusterscontained in these clusters of the database, as well as a renewedstatistical evaluation of the data contained in the subclusters can bemade. In general, this procedure can be repeated as often as desired.Here it is possible to create more efficient statistics or respond tostatistical requests.

Similarly, according to the invention, a clustering model based on adistance measurement can be used to subdivide the data of a databaseinto many clusters in which case the relevant clusters of the database(cluster) are restricted. In order to determine the relative frequenciesand expected values of the states of variables, conventional databasereporting methods or OLAP methods are used.

In principle, this invention can be used everywhere where an efficientstatistical evaluation of large amounts of data is required.

Therefore, a possible application is in the Web reporting/Web miningarea as has already been shown in the embodiment.

Further possible applications can for example be found there where thecustomer data is obtained in large amounts, such as:

-   -   data from call centers,    -   data from operational custom relationship management systems,    -   data from the health area,    -   data from medical databases,    -   data from environmental databases,    -   data from genome databases,    -   data from the financial area.

1. Method for the automatic, software-driven statistical evaluation oflarge amounts of data that is to be assigned to statistical variables ina database, in particular, contained in one or several clusters which ischaracterized in that a statistical model for the approximatedescription of the relative frequencies of the states of the variablesand the statistical dependencies between said states, is learnt and bymeans of the data stored in the database and is used to determine, onthe basis of the statistical model, the approximate relative frequenciesof states of the variables, in addition to the approximate relativefrequencies belonging to the pre-determinable relative frequencies ofstates of the variables and expected values of the states of variablesdependent thereon.
 2. Method according to claim 1, characterized in thatas the statistical model, a graphical probabilistic model, in particulara Bayesian network, is used.
 3. Method according to claim 1,characterized in that a statistical clustering model, in particular aBayesian clustering model, is used by means of which the data issubdivided into many clusters.
 4. Method according to claim 1,characterized in that likewise a clustering model based on a distancemeasurement is used by means of which the data is likewise subdividedinto a plurality of clusters.
 5. Method according to claim 3 or 4,characterized in that the considered data is restricted to the datacontained in one cluster or a number of clusters.
 6. Method according toclaim 5, characterized in that it is possible that such clusters arerestricted in which the data belonging to the specific states ofvariables contains at least one specific relative frequency.
 7. Methodaccording to one of the claims 4 to 6, characterized in that the databelonging to a cluster is stored on a data carrier in a way appropriateto the cluster affiliation.
 8. Method according to one of the previousclaims, characterized in that database reporting methods or OLAF methodsare further used to determine the relative frequencies and expectedvalues of the states of variables.
 9. Method according to claim 8,characterized in that database reporting methods or OLAP methods areused if a test variable assumes or exceeds a predetermined value. 10.Method for the automatic, software-driven statistical evaluation oflarge amounts of data that is to be assigned to statistical variables ina database, in particular, contained in one or several clusters which ischaracterized in that, the data is subdivided into many clusters by aclustering model based on distance measurement and, if required, theconsidered data is restricted to the data contained in one cluster orseveral clusters, and database reporting methods or the OLAF methods areused to determine the relative frequencies and expected values of thestates of variables.
 11. Application of the method according to one ofthe previous claims for the statistical evaluation of customer data, inparticular, in the Web reporting/Web mining area and in customerrelationship management systems.
 12. Application of the method accordingto one of the previous claims for the statistical evaluation ofenvironmental databases, medical databases or genome databases.